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Structured additive regression provides a general framework for complex 
Gaussian and non-Gaussian regression models, with predictors comprising 
arbitrary combinations of nonlinear functions and surfaces, spatial effects, 
varying coefficients, random effects and further regression terms. The large 
flexibility of structured additive regression makes function selection a chal- 
lenging and important task, aiming at (1) selecting the relevant covariates, (2) 
choosing an appropriate and parsimonious representation of the impact of co- 
^ variates on the predictor and (3) determining the required interactions. We 

propose a spike-and-slab prior structure for function selection that allows to 

O 

D include or exclude single coefficients as well as blocks of coefficients repre- 

^ senting specific model terms. A novel multiplicative parameter expansion is 

,——1 required to obtain good mixing and convergence properties in a Markov chain 

^ Monte Carlo simulation approach and is shown to induce desirable shrink- 

-4-i age properties. In simulation studies and with (real) benchmark classification 

c/o data, we investigate sensitivity to hyperparameter settings and compare per- 

^ formance to competitors. The flexibility and applicability of our approach are 

> 

demonstrated in an additive piecewise exponential model with time-varying 

in 

Cn| effects for right-censored survival times of intensive care patients with sepsis. 

in 

Geoadditive and additive mixed logit model applications are discussed in an 

o 

^ extensive appendix. 

Key-words: parameter expansion, penalized splines, stochastic search variable selection, general- 

> 

ized additive mixed models, spatial regression 

1. INTRODUCTION 

Recent research on function selection mostly considers the additive model 
y = fi{x\) + . . . -\- fc^{xcj) + e iox Gaussian responses, sometimes including additional lin- 

*Fabian Scheipl is Postdoctoral Fellow (E-mail: fabian.scheipl@stat.uni-muenchen.de) and Ludwig Fahrmeir 
is Professor (emeritus). Department of Statistics, Ludwig-Maximilians-Universitat Miinchen, Munich, 
Germany. Thomas Kneib is Professor, Department of Economics, Georg-August-Universitiit Gottingen, 
Gottingen, Germany. This work was supported by the German Science Foimdation (DFG grant FA 128/5- 

D- 



1 



ear effects or interactions of functions. In this paper we introduce a spike-and-slab prior 
structure to perform Bayesian inference and function selection in structured additive re- 
gression (STAR) models, i.e., in exponential family regression models with additive pre- 
dictors incorporating different types of functions or effects. Compared to additive models, 
we therefore extend function selection to regression with non-Gaussian, in particular dis- 
crete responses and the predictor may contain additional components, such as varying 
coefficient terms uf{x), smooth interactions f{xi,X2), spatial effects /geo(s) for geoaddi- 
tive regression, and cluster-specific random effects. Functions of continuous covariates 
are represented through penalized (tensor product) splines, /geo(s) through (condition- 
ally) Gaussian Markov random fields, and cluster-specific effects through (conditionally) 
Gaussian i.i.d. random effects, but other basic function expansions, surface smoothers, 
and spatial models are possible as well, see e.g., Fahrmeir et al. ( 2004[ | for details. Any 
generalized structured additive regression model can then be written in unifying form as 



ZpSp, 



(1) 



where the conditional expectation E{y\t]) of the response vector y = (j/i, . . .,i/„)' is re- 
lated to a predictor vector // = [rji, ... via a known response function h as in gener- 
alized linear models. The predictor vector tj is additively composed of covariate effects 
/;• = ifji^ji)' ■ ■ ■ 'fji^jn))' of different types. Each effect fj can also be a function of mul- 
tiple covariates and is represented by suitable design matrices Zj of dimension (n x Dj), 
and Dy-dimensional regression coefficients Sj\sj ~ N{0,sjP^) with fixed positive (semi- 
)definite scaled precision matrix Pj and prior variance sj. Singular precision matrices re- 
sult from Bayesian P-Splines ( [Brezger and Lang 2006| l or intrinsic Markov random fields 
dRue and Heldl[2005t . 

Section 2.1 shows how the terms in ([l} can be reparameterized in terms of modified 
design matrices associated with conditionally Gaussian i.i.d. coefficient vectors. This 



reparameterization essentially follows ideas similar to those considered in Ruppert et al. 
( |2003| l or Fahrmeir et al. ( 2004 1 but is especially designed to improve the selection proper- 
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ties of our approach. It also has the advantage that additional function selection questions 
can be handled, for example differentiating between no effect, linear effect and inherently 
nonlinear effect of a continuous covariate (see Section [ZT] for details). 

In function selection, we are interested in finding simple special cases of ([T]), where 
some of the functions are identified as having negligible impact on the response. For 
example, in our geoadditive regression model of rents in the city of Munich, we want to 
select a subset from a large set of categorical covariates and to decide if effects of age of a 
building and floor space of the flat are nonlinear or linear, if an interaction between them 
is necessary, and if a spatial effect in form of a Markov random field representing the 
location of buildings is needed. In the application dealing with the analysis of survival 
times of patients that acquired a septic infection after surgery, we are interested in finding 
a parsimonious model to indicate which covariates have (potentially nonlinear) impact on 
survival and to evaluate the presence or absence of time-varying effects indicating non- 
proportional hazards. 

In a Bayesian or mixed models framework, functional effects in structured additive re- 
gression are reparameterized as i.i.d. Gaussian random effects (or i.i.d. Gaussian priors 
from a Bayesian perspective). This ideas has become quite popular since it allows treat- 
ing complex regression models as mixed models and makes corresponding restricted 
maximum likelihood (REML) estimation available for the determination of smoothing 
Wand ( 2000[ > was among the first to recognize this possibility and introduced 



variances. 



mixed model based inference for penalized spline regression based on truncated power 
series expansions in Gaussian regression models. Ruppert et al. ( |2003 1 provide an in- 
depth overview on mixed model based semiparametric regression and describe exten- 
sions to the estimation of interaction surfaces and spatial effects. In an (empirical) Bayes 
framework, the mixed model representation corresponds to a reparameterisation of the 



prior and REML estimation is interpreted as marginal likelihood estimation, see Fahrmeir 



et al. ( |2004[ > for more details in the context of (generalized) STAR models and Crainiceanu 



et al. ( |2005| for the implementation of such models via WinBUGS. The mixed model per- 



spective on semiparametric regression has also led to the development of likelihood ratio 
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tests that implement the selection of functions fj{xj) based on testing Hq : sj = versus 



Crainiceanu et al. 



2005 



Greven et al. 



2008 



Scheipl et al. 



2008 1, since 



Ha : > (cf. 

sj = implies fj = ZjSj = 0. However, these likelihood ratio tests are so far only appli- 
cable in Gaussian regression models and are not suitable for automatic function selection 
in complex regression models with a large number of potentially nonlinear effects. 

To develop a Bayesian counterpart of likelihood ratio tests, it seems natural to impose 
a bimodal spike-and-slab prior on the variances sj, as suggested in 



Ishwaran and Rao 



( [2005l l for the case of variable selection in high-dimensional linear models, i.e. for select- 
ing single scalar regression coefficients. Indeed, our first attempt to select functions in 
STAR models was based on this simple idea. However, as shown in the web appendix, 
such a straightforward approach is rendered infeasible by the severe convergence and 
mixing problems it causes. Informally, the problem is that a small variance for a block 
of coefficients implies small coefficient values and small coefficient values in turn imply 
a small variance. Therefore, blockwise MCMC samplers are unlikely to exit a basin of 
attraction around zero. We therefore propose a multiplicative parameter expansion for 
the regression coefficients inspired by the work of Gelman et al. ( 2008[ l in the context of 
mixed models. We show that this parameter expansion leads to an efficient MCMC strat- 
egy and to a prior with regularization properties similar to /^^-penalization with < 1 in 
Section 12.31 

The main advantages of our new prior structure can be summarized as follows: First, 
unlike standard stochastic search variable selection approaches, it can routinely be used 
with non-Gaussian responses from the exponential family based on iteratively weighted 
least squares updates. Second, it supports the full generality of STAR models, i.e. it ac- 
commodates all types of regularized effects with a (conditionally) Gaussian prior such 
as simple covariates or covariate blocks (both continuous and categorical), penalized 
splines (uni- or multivariate), spatial effects, random effects or ridge-penalized factors 
and all their interactions (e.g. (space-)varying coefficient terms or random slopes). Third, 
it scales to data sets of intermediate size with thousands of observations and high- 
dimensional predictors including hundreds of model terms. Fourth, it is implemented 
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in publicly available and user-friendly open source software (R-package spikeSlabGAM 
( [Scheipl 2011b (), therefore allowing reproducibility of our results and immediate appli- 
cation to new data sets. 

Due to the practical importance of the topic, there is a vast amount of literature on 
selecting components in predictors of regression models. Most previous work considers 
selection of variables or associated (single) regression coefficients in high-dimensional 
(generalized) linear models. Penalization approaches have become quite popular, in 
particular the Lasso or the SCAD penalty and modifications as the adaptive or group 
Lasso. Another branch is boosting, see Biihlmann and Hothorn| ( 2007| for a survey. Most 
Bayesian approaches for variable selection are based on spike-and-slab priors for regres- 
sion coefficients, see for example the stochastic search variable selection (SSVS) approach 
George and McCulloch] ( |1993| l, among other methods, and the review in [O'Hara and 



m 



SiUanpaa (2009). 



In comparison, research on function selection is more sparse. Recently, [Marra and 



Wood ( 2011[ l have proposed a "double shrinkage" approach for GAMs with an additional 



penalty on the null space of the smoothness penalty which enables shrinking entire func- 
tional terms to zero. Most other penalization methods only consider additive models 
for continuous (Gaussian) responses and perform function selection by penalizing cer- 
tain norms of functional components or associated blocks of basis function coefficients in 



Lasso- or SC AD-type fashion. Lin and Zhang ( 2006 ( proposed the component selection 
and smoothing operator (COSSO) in additive smoothing spline ANOVA models. Mo- 



tivated by the adaptive group Lasso, Storlie et al. (20101 propose the adaptive COSSO 
(ACOSSO) to penalize each functional component differently so that more flexibility is 
obtained. Its superior performance to the COSSO (and MARS) is demonstrated for sim- 



ulated and real data. A similar adaptive group Lasso approach is studied in Huang 



et al. ( 2010[ >. Ravikumar et al.] ( |2009[ > estimate sparse additive models by penalizing the 



quadratic norm of functional predictor terms, Meier et al. ( 2009| additionally impose a 
smoothness penalty. 

Many Bayesian function selection approaches are based on introducing spike-and-slab 
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priors with a point mass at zero directly for blocks of basis function coefficients or, equiv- 
alently, indicator variables for functions being zero or nonzero. [Wood et al.| ( p002l > andfV^ 



et al. ( |2003| > describe implementations using a data-based prior that requires two MCMC 



runs, a pilot run to obtain a data-based prior for the "slab" part and a second one to es- 
timate parameters and select model components. [Panagiotelis and Smith ( 2008| l combine 
this stochastic search variable selection approach with partially improper Gaussian pri- 
ors, as for basis coefficients of Bayesian P-splines, in high-dimensional additive models. 



They suggest several sampling schemes that dominate the scheme in Yau et al. (2003(. A 
more general approach based on double exponential regression models that also allows 
for flexible modeling of the dispersion is described by [Cottet et al. ( |2008| |. They use a 
reduced rank representation of cubic smoothing splines with a very small number of ba- 
sis functions to model the smooth terms in order to reduce the complexity of the fitted 
models, and, presumably, to avoid the mixing problems already mentioned and described 
in the web appendix. Reich et al. ( |2009[ > use the smoothing spline ANOVA framework 
and a spike-and-slab prior for the variance of Gaussian process priors to perform variable 
and function selection via SSVS for Gaussian responses. In a wavelet-based functional 
Gaussian mixed model framework, Morris and CarroII| ( 2006] l place spike and slab pri- 
ors directly on wavelet coefficients to decide whether they are important for representing 
functional effects. This approach is further extended by Zhu et al. ( |2011 > who improve 
robustness and adaptivity by considering scale mixtures of normals in the spike and slab 
specification. Zhu et al. ( |2010[ > develop a probit model for functional data classification 
that allows to select functional predictors. Their hierarchical model is based on a latent 



Gaussian model and on Gaussian process priors for functional effects. Like Reich et al. 



( |2009| l, they place a spike and slab prior on the variance, while the remaining part of the 
Gaussian process covariance matrix is assumed to be known. This (strong) assumption 
and the possibility to integrate out functional effect parameters in full conditionals for 
variance parameters facilitates MCMC inference in this case. 

Function selection in generalized additive models using Bayes factors has recently been 
proposed in two ways: Sabanes Bove and Held ( 2011[ | present a basis function selec- 
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tion approach for Bayesian fractional polynomials for potentially nonlinear effects, while 
the approach of Sabanes Bove et al. ( |2011 > is based on a grid of fixed effective degrees 
of freedom for each penalised spline. Chipman et al. ( |2010[ | propose Bayesian adaptive 
regression trees (BART) to develop a completely non-parametric prediction-oriented ap- 
proach. 

In principle, adaptive smoothing approaches based on knot selection strategies as sug- 
gested for example in Denison et al. ( |1998| > for the Bayesian MARS or Dimatteo et al. 



( |2001| | for adaptive regression splines (BARS) could be considered as additional com- 
petitors for function selection. In particular, knot selection strategies typically involve 
the possibility to deselect covariate effects by excluding all basis functions associated 
with a specific covariate. However, this possibility is usually only a by-product of the 
model specification and is not directly intended for function selection. Roughly, these 
approaches correspond to specifying separate i. i. d. spike and slab priors for the scalar 
basis function coefficients, rather than imposing a multivariate spike and slab prior on the 
entire vector of basis function coefficients as in Panagiotelis and Smith ( |2008| . Note that 
even fitting of a single function with adaptive smoothing based on knot selection can be 
extremely time-consuming, see for example the comparison of BARS with the adaptive 
penalty approach in Krivobokova et al. ( |2008 |. As a consequence, selecting functions in 
this fashion becomes inefficient, if not computationally infeasible. Therefore we will not 
consider knot-selection approaches in the rest of this paper. 



2. NMIG PRIORS FOR FUNCTION SELECTION 

2.1. Generic Parameterization 

Many of the regression terms available for STAR models are associated with conditionally 
Gaussian priors with zero mean and general positive semidefinite precision matrices P 
(cf. ([TJ). For example, in the case of penalized splines the precision matrix represents 
the dependence structure imposed by the random walk prior ( [Brezger and Lang[ 2006 1 



or in the case of Gaussian Markov random fields, the precision matrix is defined by the 
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neighborhood structure underlying the geographical arrangement of the data ( |Rue and 



Held 2005| |. We will now show that such general Gaussian priors can always be recast 



based on i.i.d. priors: 

Assume that J|s^ ~ N(0, s^P^) represents the D-dimensional regression coefficient cor- 
responding to one of the terms /(z) = ZS appearing in a STAR model ([T]). Let K denote 
the dimension of the null-space of P. Since /(z) = ZS and ZS\s^ ~ N{0,s^ZP-Z ), 
the spectral decomposition ZP~Z' = UVU' with orthonormal IT and a diagonal V with 
entries > yields an orthogonal basis representation for the improper prior covariance 
of /(z). For Z with D columns and full column rank and P with rank d = D — K, all 
eigenvalues in V except the first d are zero. Now write ZP'Z' = [U+Uq]' [ V o] i'^+'^ol 
where 11+ is a matrix of eigenvectors associated with the positive eigenvalues in V^, and 
Uo are the eigenvectors associated with the zero eigenvalues. With Xpen = U+V+''^ and 
^pen ~ N(0,c^I), /pen(z) = XpenjSpgj^ has a proper Gaussian distribution that is propor- 



tional to that of the partially improper prior of /(z) (Rue and Held 2005 eq. (3.16)) 
but parameterizes only the penalized component of /(z), while Xq = ITq and the as- 
sociated coefficients )3g parameterize functions /o(z) = Xq/Sq in the null space of P. In 
summary, we reparameterize and decompose /(z) = /o(z) +/pen(z). Our reparameter- 
ization follows similar ideas as in early references on mixed model based inference in 
semiparametric regression (see for example [Wand ( 2000[ l; Ruppert et al. ( |2003 1 for cor- 
responding frequentist approaches and Fahrmeir et al. ( |2004 1; Crainiceanu et al. ( |2005[ ) 
for Bayesian interpretations) but is designed for the special purpose of function selection. 
Basing the decomposition on the spectral decomposition of ZP~Z' instead of the spectral 
decomposition of P yields an orthogonal basis representation and therefore facilitates the 
differentiation between penalized and unpenalized parts of a function. 

In practice, it is unnecessary and impractically slow to compute all n eigenvectors and 
values for a full spectral decomposition MVW' . Only the first d are needed for X, and of 
those the first few typically represent most of the variability in /(z). Our implementation 
makes use of a fast truncated bidiagonalization algorithm ( Baglama and Reichelj 20061 
available in iriba ( [Lewis 2009 1 to compute only the largest d eigenvalues of Cov(/(z)) 
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and their associated eigenvectors. Only the first d eigenvectors and -values whose sum 
represents at least .995 of the sum of all eigenvalues are used to construct the reduced 
rank orthogonal basis Xpen with d columns. E.g. for a cubic P-spline with second order 
difference penalty and 20 basis functions (i.e. D = 20 columns in Z and K = 2), X will 
often have only 8 to 12 columns and Xq has one column for the linear trend since the 
constant part of /o(z) is subsumed into the global intercept to ensure identifiability. 

The advantages of applying a reparameterization are two-fold: First, we can separate 
the penalized part of a predictor function /(z). Second, we can now not only assign an 
i.i.d. Gaussian prior to the penalized part but also to the unpenalized part. Of course 
we then no longer have a one-to-one transformation of the original prior but we can per- 
form function selection on both the penalized and the unpenalized parts. For example, 
in case of penalized splines with A;-th order random walk prior, the space of unpenalized 
functions consists of all polynomials of order less than k — 1. Separating these polyno- 
mials from the non-polynomial, penalized part of the function opens up the possibility 
to decide whether a nonlinear effect for a continuous covariate should be included in the 
model at all, whether it can be described in terms of a linear effect or whether a nonlinear 
effect is needed. The resulting models are more parsimonious and easier to interpret. 

In the following, we assume that the reparameterization has been applied to all relevant 
model terms and we treat /o(z) and/pen (z) as separate model terms, so that »/ = '/q + 
TJ^i ZjSj with Jj|sy ~ N{0,sjPj) (cf. Q) can now be rewritten as 

V = Vo + L^jh (2) 

with t]Q containing a global intercept, offset terms, and effects that are not associated with 
a variable selection prior, l^jlvj ~ N(0, vjl) and p > P the new number of separate model 
terms now comprising both Xq^j and Xpgnj, j = 1, . . . ,p. 
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2.2. Parameter-Expanded NMIG Prior 



Inspired by the work of Gelman et al. ( 2008| >, we propose a multiplicative parameterization 



of and combine it with a spike-and-slab prior based on a mixture of inverse gamma 
distributions for the variances vj. More specifically we multiplicatively expand the dj- 
dimensional vector to /3y = Uj^j, E M.'^', where the scalar parameter Uj parameterizes 
the importance of the j-th coefficient block, while "distributes" ay across the entries in 

We assume that a.j ~ N{0,vj) follows a univariate Gaussian distribution with variance 
vj = ^jjj given by the product of an indicator variable 7^ and the prior variance tJ. In a 
further level of the hierarchy we specify 

rf ~ Y'^{ar,'bj), jj ~ zvSi{jj) + (1 - w)3^,{jj), 

i.e. the variance t? is assumed to follow an inverse gamma-prior with shape parameter 
At and scale parameter br chosen such that br ^ flr- As a consequence, the mode bj/a^ 
is significantly greater than 1. The indicator takes the value 1 with probability w or 
some (very) small value Vq with probability 1 — w. The implied prior for the effective 
variance vj = jjzj' is a bimodal mixture of inverse gamma distributions, with one com- 
ponent strongly concentrated on very small values - the spike with 7^ = vq and effective 
scale parameter Vob-^^ - and a second more diffuse component with most mass on larger 
values - the slab with 7^ = 1 and scale b-c- A coefficient associated with a variance that 
is primarily sampled from the spike-part of the prior will be strongly shrunken towards 
zero if Vq is sufficiently small, so that the posterior probability for jj = vq can be inter- 
preted as the probability of exclusion of /S^ and the corresponding function fj from the 
model. A Beta prior for the mixture weights w can be used to incorporate the analyst's 
prior knowledge about the sparsity of j8 or, more practically, enforce sufficiently sparse 
solutions for overparameterized models. 

We refer to the complete prior structure for ocj as a normal-mixture-of-inverse gamma 
(NMIG) distributions, denoted as aj ~ NMlG{vo,w,aT:,br)- A similar NMIG prior has 
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originally been suggested in |Ishwaran and Rao ( |2005| > for selecting single coefficients 
/6y ~ N{0,vj) in high-dimensional linear models. 

Integrating out rj- and 7y from Uj ~ N{0,vj = ^jtJ), keeping w fixed, the bimodal 
Inverse Gamma prior induces the mixture of two scaled t-distributions 



DCj\w ~ (1 — w)t{df, So) + w t{df,si), 



with df = lUr, So = \/vobj/ar, Si = \/br/ar, as a spike and slab prior directly on ay. 
This may suggest to put other spike and slab priors directly on ocj, in particular a bimodal 
mixture of normals 



(3) 



with Vy » v^j, introduced by George and McCulloch (1993i for selecting scalar coef 



ficients in high dimensional linear models. Another alternative seems to be to replace 
N(0, i/gy) by a point mass at zero, but this works only for Gaussian regression models 
where it is possible to integrate out parameters analytically from full conditionals for 
indicator variables. 

We prefer the NMIG prior for aj, inducing the student spike and slab prior, for several 
reasons: First, as noted in Ishwaran and Rao ( |2005| and supported by our own experience, 
posterior inference and model selection is relatively robust with respect to hyperparame- 
ters in the NMIG hierarchy. It can be more difficult to specify vqj, vy for the Gaussian spike 
and slab prior (|3|. Second, Section 5 of Ishwaran and Rao ( |2005 1 provides theoretical argu- 
ments in favor of a bimodal continuous prior for variances, as in the NMIG and peNMIG 
prior, whereas (|3]l is a bimodal discrete prior (1 — w)I{vj = v^^) + wl{vj = Vy). Further- 
more, the favorable shrinkage properties of the NMIG prior induce desirable shrinkage 
properties for the coefficient vector jSy, see Subsection 



2.3 



Entries of the vector are assigned the prior distribution 



1 1 
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i.e. we assume i.i.d. mixtures of two Gaussian distributions with expectation ±1. Al- 
though the marginal prior for still has zero expectation, the bivariate mixture assigns 
most of the prior mass close to the multiplicative identity (with positive or negative sign). 
This enables the interpretation of (Xj as the "importance" of the 7-th coefficient block and 
yields a marginal prior for that is less concentrated on small absolute values than with 
a standard assumption like ^j]^ ~ N(0, 1). 

The prior specification for j8y is completed by assuming prior independence between 
Kj and ^j. We write )3y ~ TpeNMlG{vo,w,ar,bT) for the complete prior structure also 
summarized as a directed acyclic graph in Figure [l] 

The main advantage of the peNMIG-prior is that the dimension of the coefficient vector 
associated with updating 7^ and tJ- is equal to one in every penalization group, since 
the Markov blankets of both 7^ and Tj only contain the scalar parameter aj instead of 
the vector jSy. This is crucial in order to avoid mixing problems that would arise in 
a conventional NMIG prior without parameter expansion (see web appendix A). The 
vector f = {^[, . . . is decomposed into subvectors ^j, j = l,...,p, associated with 
the different model terms and their respective entries a.j in a. Note that vji typically also 
includes terms that are not under selection, such as known offsets, a global intercept or 
covariate effects that are forced into the model. Their coefficients are associated with 
weakly informative flat Gaussian priors. 

For Gaussian responses, we assume an lG{aa,ba) prior for the variance a^. 

2.3. Shrinkage Properties 

This section describes regularization properties of marginal priors for regression coeffi- 
cients, implied by the hierarchical prior structure described in the previous section and 
visualized in Figure [T] We analyze marginal priors because it is their shape - and less that 
of the conditional priors - that determines the shrinkage properties. 

For comparison with other shrinkage priors recently suggested for pure variable selec- 
tion, i.e. for selecting single scalar regression coefficients rather than blocks of coefficients, 
we first consider univariate marginal priors. Omitting indices ; and the dependence on hy- 
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J// ~ Expo.fam.(g(i7;)) 



i=l,...,n 



Figure 1: Directed acyclic graph of NMIG model with parameter expansion. 
Ellipses are stochastic nodes, rectangles are deterministic /logical nodes. Sin- 
gle arrows are stochastic edges, double arrows are logical/deterministic 
edges. 



perparameters Uj, hr, a-u,, hw and Vq, the marginal prior for a scalar coefficient /3 is obtained 
by integrating out all other random variables appearing in the prior hierarchy, i.e. 



B 1 

p{ci\j, T'^)p{^\m)j^p{m)p{T^)p{'y\w)p{w)da.dmdT'^djdw. 



(4) 



Whereas the marginal prior for the original NMIG spike-and-slab prior of Ishwaran 



and Rao ( 2005 1 can be derived analytically as a mixture of scaled t-distributions, the above 
integral has no known closed form and has to be computed numerically. The marginal 
NMIG prior has a finite spike around zero, corresponding to the first component of the 
scaled t-mixture, and a slab corresponding to the second component. In comparison, the 
marginal peNMIG prior has heavier tails and an infinite spike at zero. Its shape is very 
close to the horseshoe prior which has favorable theoretical properties. The peNMIG 
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prior also looks similar to the original spike-and-slab prior suggested by Mitchell and 



Beauchamp |[1988). The tails of the marginal peNMIG prior are also heavy enough to 



imply redescending score fimctions which ensures Bayesian robustness of the resulting 
shrinkage estimators. The shape of the score function is similar to that of an >C(j-prior 



with (J — > and is fairly robust with respect to hyperparameters, see Figure 5 in Scheipl 

In summary, the peNMIG prior for scalar j6 combines an infinite spike at zero with 
heavy tails. This desirable combination is similar to other shrinkage priors, including the 
horseshoe prior and also the normal-Jeffreys prior ( |Bae and Mallick 2004| |, for which both 
robustness for large values of j5 and very efficient estimation of sparse coefficient vectors 
have been shown (Carvalho et al. 20 10[ [Poison and Scott 20101. 

A main advantage of our peNMIG prior is its generalization to multiple shrinkage of 
blocks of coefficients vectors, both in terms of sampling and shrinkage properties. We il- 
lustrate this for two-dimensional coefficients (j6i,/32), distinguishing two situations: First, 
/6i and 1^2 come from two different coefficient (or penalization) groups, i.e. we have 
f^i = ai^i and /32 = oci^i, with cti independent from a.2, in terms of the reparameterized 
coefficients. Second, j6i and j52 are from the same block of coefficients representing one 
penalization group, so that (j6i,j62) = a(^i,^2)/ with identical cci = a.2 = a. Figure |2] 
shows contour plots of log p(/3i, /52) for these two situations and, for comparison, for the 
original NMIG prior which applies only to the first situation. 

The shape of the constraint region implied by the peNMIG prior (middle of Figure 
[2j| has the convex shape of a >Cq-penalty function with q < 1, which has the desirable 
properties of simultaneous strong shrinkage of small coefficients and weak shrinkage of 
large coefficients due to its closeness to the Cq penalty. 

The contours of the NMIG prior (left part of Figure |2]) have different shapes depending 
on the distance from the origin. Close to the origin (/3 < .3), they are circular and very 
closely spaced, implying strong ridge-type shrinkage: Coefficient values close to zero fall 
into the spike-part of the prior and will be strongly shrunk towards zero. Moving away 
from the origin (.3 < /S < .8), the shape of the contours defining the constraint region 
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Figure 2: Contour plots of lo§p{{j5i, fij)') for flr = 5, bj = 50, vq = 0.005, 
ciw = bii, for (from left to right) the NMIG prior for two coefficients from dif- 
ferent penalization groups, the peNMIG prior for two coefficients from dif- 
ferent penalization groups and the peNMIG prior for two coefficients from 
the same penalization group. 

morphs into a rhombus shape with rour\ded corners that is similar to that produced by 
a Cauchy prior. Still further from the origin (1 < < 2), the contours become convex 
and resemble those of the contours of an £^-penalty function with q < 1. Coefficient 
pairs in this region will be shrunk towards one of the axes, depending on which of their 
maximum likelihood estimators is bigger and their posterior correlation. For larger values 
of coefficient pairs, the contours (not shown in Figure |2| imply ridge-type shrinkage. 

The shape of the constraint region for coefficient pairs (/3i,/52) = a(^i,^2) from the 
same penalization group (right part of Figure |2]) resembles that of a square with rounded 
corners. Compared with the convex shape of the constraint region in the middle, this 
shape induces less shrinkage towards the axes and more towards the origin or along the 
bisecting angle. 
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2.4. Markov Chain Monte Carlo Algorithm 

Posterior inference and function selection is based on a blockwise Metropolis-within- 
Gibbs sampler. The sampler cyclically updates the nodes in Figure [T] For Gaussian 
responses it reduces to a Gibbs sampler. The full conditionals of the parameters w, rj, 
7y, i = 1,. . . ,p, and the means m = {nii, . . . , mi, . . . , m^)' of the conditionally Gaussian 
variables ^|m/ ~ N [m;,!), nii = ±1, are available in closed form and are included in 
Algorithm [T] in the appendix. They do not depend on the specific exponential family 
chosen for the responses. 

The full conditionals for a = {oci, . . .,ocp)' and f = {^[, . . . , ^p)' depend on the "condi- 
tional" design matrices Xa = Xblockdiag(^2, . . . , ^p) and 

= Xdiag(blockdiag(l(ii, . . . ,ldp)ft^)/ respectively, where 1^; is a cf x 1 vector of ones 
and X = (Xi, . . . , Xp) is the concatenated design matrix. For Gaussian responses, the full 
conditionals are given by 

a|- ~ N{fi^,'La) with 

= (\xlXc + diag {yr^) M , jij = ^L^Xly, and 

^ (5) 

^1- ~ Nifi^r-L;^) with 

/I . \-i 



For non-Gaussian responses, we use an MH algorithm with a penalized IWLS proposal 
(P-IWLS) based on an approximation of the current posterior mode, described in detail 
Brezger and Lang| ( |2006| > (Sampling scheme 1, Section 3.1.1). The MH step uses a 



m 



Gaussian (i.e. second order Taylor) approximation around the approximate mode of the 
full conditional as its proposal distribution. To decrease computational complexity, we 
modify the IWLS algorithm by using the mean of the proposal distribution of the previous 
step instead of the posterior mode. Because of the prohibitive computational cost for 
large q and p (and low acceptance rates for non-Gaussian response for high-dimensional 
IWLS proposals), neither a nor ^ are updated all at once. Rather, both a and ^ are 
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split into ba (b^) update blocks that are updated sequentially conditional on the states of 
all other parameters. For efficient and numerically stable draws from the multivariate 
Gaussian densities in (|5| or in the IWLS algorithm, we use QR decompositions of the 
covariance matrices. Alternatively, Cholesky decompositions as in [Lang and Brezger 
( |2004| | or [Brezger and Lang] ( |2006| may be employed. 

After updating the entire a- and ^-vectors, each subvector is rescaled so that | \ has 
mean 1, and the associated ocj is rescaled accordingly so that = OLj^j is unchanged: 

This rescaling is advantageous since ocj and are not identifiable and thus their sampling 
paths can wander off into extreme regions of the parameter space without affecting the 
fit, e.g. oij becoming extremely large while entries in simultaneously become extremely 
small. By rescaling, we retain the interpretation of ocj as a scaling factor representing the 
importance of the model term associated with it and avoid numerical problems that can 
occur for extreme parameter values. 

By default, starting values )S^*^^ are drawn randomly in three steps: First, we do 5 Fisher 
scoring steps with fixed, large hypervariances. Second, for each chain run in parallel, 
Gaussian noise is added to this vector, and third its constituting p subvectors are scaled 
with variance parameters 7yT^^ {j = 1, . . . ,p) drawn from their priors. This means some 
of the p model terms are set close to zero initially, and the remainder is in the vicinity of 
their respective ridge-penalized maximum likelihood estimates. Starting values for a'*^' 
and are then computed via af^ = df I and = ^f/afK 

Function selection, i.e. selection of coefficient blocks ] = 1, . . . , p can be based on the 
posterior inclusion probabilities P(7y = Instead of estimating them simply through 
the proportion of MCMC samples (f ) for which 7j^^ = 1, which may have high sampling 
variance, we improve precision through the Rao-Blackwellized estimate P(7; = = 
T^^ Ylt^ilJ^ — The full conditional probabilities P{'yJ^ = are directly available 
from step 14 of the MCMC Algorithm [l] (see appendix). Based on these quantities, we can 
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evaluate the evidence for the effect of a continuous covariate being linear or nonlinear, 
whether a model term has a relevant effect at all, which interaction terms are relevant, 
and so on. 



3. EMPIRICAL EVALUATION 

3.1. Simulations 

To evaluate the performance of the peNMIG prior for function selection in generalized 
additive models, we conducted extensive simulations representing scenarios of different 
complexity comprising different response types, sample sizes, signal to noise ratios, corre- 
lated and uncorrected covariates, varying degrees of concurvity, and predictors of either 
high or low sparsity. As competitors, we considered component-wise boosting ( [Hothorn 



etarilZOTol , ACOSSO dStorlie et al.l[20Tol ), as well as SPAM ( |Ravikumar etaTl [20091 , the 



double shrinkage GAM proposed by Marra and Wood (20TTJ, the HGAM approach of 



Meier et al. ( 2009[ > and finally Bayesian additive regression trees ( [Chipman et al. 20101 as 



a non-parametric, "black-box prediction" alternative. As benchmark, we used a conven- 
tional GAM based on the true model structure. Note that ACOSSO, SPAM and HGAM 
are available for Gaussian responses only. Predictive deviance and the ability to recover 
the correct model complexity were used as performance measures. Detailed description 
and graphical summaries of these simulations can be found in Section |C] of the appendix. 
Extensive additional simulation studies are described in the first author's dissertation 



( [Scheipl[[20Tl^ . 



The main conclusion that can be drawn from the simulations is that the proposed ap- 
proach is highly competitive to previous suggestions while having the advantage of being 
applicable both for generalized exponential family regression (while most previous sug- 
gestions are restricted to Gaussian responses) and a much broader class of model terms 
(most previous suggestions are restricted to univariate smooth functions). Estimation 
based on the peNMIG prior typically achieves good estimation performance simultane- 
ously with high accuracy in determining the correct model specification. It is robust 
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against correlated covariates and low signal-to-noise ratios. Prediction accuracy is ro- 
bust against concurvity, however very strong concurvity of course leads to diminished 
selection accuracy. Selection of large coefficient blocks such as random effects for non- 
Gaussian response can be problematic: For both Poisson and binary responses, the selec- 
tion accuracy in this case was very low, albeit without adverse effects on the estimation 
of the coefficients. 

Robustness to hyperparameter settings was investigated by running the simulations 
for combinations of Vq = 0.01,0.005,0.00025 and (at^&t) = (5,25), (5,50), (10,30). Pre- 
diction accuracy was very robust against different hyperparameter configurations in all 
the settings we considered while variable selection and model choice were more sensitive 
to varying hyperparameters, because estimated posterior inclusion probabilities for small 
effects are sensitive to the value for vq. This parameter controls the threshold of rele- 
vance of the model terms: in general, very small Vq means small effects are more likely 
to be included in the model, while larger vq yield more conservative selection properties. 
However, the conducted simulations and the applications presented in the following sec- 
tions provide a solid foundation for the choice of appropriate values for a wide range 
of applied problems. We have successfully fitted all of the application examples and the 
binary classification data discussed in the following section with a default prior that uses 
Vq = 0.00025, {ar,hr) = (5,25) and {ur^h^) = (1,1). 



3.2. Binary Classification Benchmarks 

We use a collection of 21 data sets for binary classification to investigate the performance 
of our approach on some well known benchmarks. The same collection has previously 
been used for benchmarking in Meyer et al. ( |2003| l which contains some more details on 



the datasets that we use. Table[T]gives an overview of the datasets and their characteristics. 
We evaluate prediction performance based on the deviance values for a 20-fold cross 
validation on each dataset. Predictive deviance D is defined as twice the average negative 
log likelihood D = —2/ rip ^{VP.ir VP,i) i^i the test sample where yp and tjp are the out- 
of-sample responses and the posterior mean of the linear predictor for the test sample. 
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data set 


n 


covariates 


of which factors 


balance 


Breast Cancer 


683 


9 


9 


0.54 


Cards 


653 


15 


5 


0.83 


Circle 


1200 


2 





0.97 


Heart 1 


296 


13 


5 


0.85 


HouseVotes84 


232 


16 





0.87 


Ionosphere 


351 


33 





0.56 


PimaDiab 


768 


8 





0.54 


Sonar 


208 


60 





0.87 


Spirals 


1200 


2 





1.00 


chess 


3196 


36 


1 


0.91 


credit 


1000 


24 


9 


0.43 


hepatitis 


80 


19 





0.19 


liver 


345 


6 





0.72 


monksS 


554 


6 


4 


0.92 


musk 


476 


166 





0.77 


promotergene 


106 


57 


57 


1.00 


ringnorm 


1200 


20 





1.00 


threenonn 


1200 


20 





1.00 


tictactoe 


958 


9 


9 


0.53 


titanic 


2201 


3 


1 


0.48 


twonorm 


1200 


20 





1.00 



Table 1: Characteristics of UCI data sets. "Balance" is the ratio between the 
niimber of observations in the larger class and the number of observations 
in the smaller class, i.e. it is 1 if the data set is balanced. 



The size of the test sample is np. As for the experiments with simulated data, we use 
component-wise boosting with separate base learners for the linear and smooth parts of 
covariate influence and compare prediction performance of the boosting models to our 
approach. For boosting, we determine the stopping iteration mgtop based on the out-of- 
bag risk in 10 bootstrap samples of the training sample. 

The second metric we are interested in is the parsimony of the estimated models. We 
simply count the number of model terms (or baselearners for boosting) included in the 
model, i.e., we count the model terms with marginal posterior inclusion probabilities 
greater than 0.5. For boosting, we count a baselearner as included if it was selected 
at least once before iteration Wstop in more than half of the bootstrap samples used to 
determine mstop- 

The data are preprocessed in order to preempt numerical problems: All covariates 
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with less than six unique values are treated as factor variables. All numeric covariates 
are logarithmized if their absolute skewness is larger than two and standardized to have 
mean zero and unit standard deviation. Incomplete observations are removed. All nu- 
meric covariates are associated with both a linear and a smooth effect. For data sets 
credit, Cards, Heartl, Ionosphere, hepatitis. Sonar and musk, we use NMIG in- 
stead of peNMIG for terms with d = 1 (i.e. linear terms and binary factors) to reduce 
the posterior's dimensionality. Estimates are based on samples from eight parallel chains 
with a burn-in of 1000 iterations, followed by a sampling phase of 5000 iterations of 
which we save every fifth. We report results for (at/^t) = (5,25), w ~ Beta(l, 1) and 
vo E {0.005,0.00025}. Results with {ar,^) = (5,50) and w ~ Beta (20, 40) were qualita- 
tively similar and are omitted for clarity of presentation. An unabridged description is in 
Scheipl| ( |20Tlal Ch. 4.2). 



Figure |3] shows the prediction accuracy for the 21 datasets. Most outliers with large 
deviances are due to the sampler getting stuck for some of the parallel chains in specific 
folds for some of the data sets. By rerunning the analysis with different starting values 
or random seeds or manual postprocessing of the posterior samples, these presumably 
could have been avoided. We include them unchanged to provide a more realistic picture 
of the reliability of our approach. Practicioners should always check acceptance rates and 
convergence diagnostics when using MCMC-based methods. Note that the predictive 
performance of our approach is usually more variable than that achieved by m boost but 
has lower median predictive deviances in all of the datasets for all prior specifications 
(Results for (a-t/^t) = (5,50) and w ~ Beta(20,40) not shown). Predictive performance 
is very robust against different hyperparameter settings, even for large p/n ratio where 
the influence of the hyperparameters on the posterior is relatively stronger. To investigate 
the parsimony of the fitted models, i.e. whether equivalent or better prediction can be 
achieved by simpler models, we plot the differences in predictive deviances versus the 
difference in the proportion of potential model terms included in the models in Figure 
|4] (Results for Spirals, tictactoe and titanic not shown because there were no dif- 
ferences in sparsity). Positive values on the vertical axis indicate smaller deviance for 
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Figure 3: UCI data I: Predictive Deviances for 20-fold crossvalidation. Box- 
plots show results for the different prior settings, the horizontal ribbon 
shows results for m boost: shaded region gives IQR, dashed line represents 
median. Dark grey lines connect results for the same fold. Dotted line gives 
predictive deviance of the null model on the full data set. 



our approach, and positive values on the horizontal axis indicate a sparser fit for our 
approach. Figure |4] shows that our approach achieves its relatively more precise predic- 
tions with smaller models on the large majority of the benchmark data sets. The only 
exceptions are datasets chess, where the increased precision is achieved at the cost of less 
sparse models, and, to a much lesser extent, twonorm and threenorm. Neither absolute 
performance nor performance relative to boosting seems to be tied to any of the easily ob- 
servable characteristics of the data sets (i.e. p, n, p/n, or balancedness). No clear picture 
emerges for the differences between the prior specifications: As expected, a smaller value 
for Vq tends to yield larger models, cf. datasets chess , twonorm, Ionosphere , musk, but 
there are counterexamples as well, e.g. threenorm. Both predictive deviance and sparsity 
results are more sensitive towards Vq than towards (at/^t) (Results for (a-t/^t) = (5,50) 
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Deviance and Sparsity Difference (datasets sorted by increasing p/N) 
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Figure 4: UCI data I: Difference in proportion of included model terms 
versus differences in predictive deviances. Positive values denote smaller 
deviances / models for our approach compared to m boost. Results for 
Spirals, titanic and tictactoe not shown because there were no dif- 
ferences in sparsity. 



not shown). Using an informative prior w ~ Beta (20, 40) to enforce model sparsity has 
no appreciable effect and does not influence prediction quality (Results not show^n). 

More generally, the performance of our approach shows that it is very competitive to 
componentwise boosting and that neither relative nor absolute performance deteriorate 
in very high-dimensional problems (cf. results for musk with n = 476 and 332 potential 
model terms, of which 166 are smooth terms.). 



3.3. Case Study: Survival of Surgical Patients with Severe Sepsis 

Data: We use data on the survival of 462 patients with severe sepsis that was collected 
in the intensive care unit of the Department of Surgery at Munich's Grofihadern hospital 
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between March 1, 1993, and February 28, 2005. Hofner et al. ( 2011[ > have previously 
analysed this data set. The follow-up period was 90 days after the beginning of intensive 
care, with one drop-out after 66 days and 179 patients surviving the observation period. 



Models: We use a piecewise exponential model (PEM) ( [Fahrmeir and Tutz 2001 Ch. 9) 



to model the hazard rate \{t,x) of the underlying disease process, i.e. for fixed time 
intervals defined by cutpoints k = (kq = 0, . .,kj = tmax), where tmax is the maximal 
follow-up time, the hazard rate for subject / at time t, Kj_i < t < Kj in the interval is 
given by 

/ L M \ 

\{t,xi) = exp ^0(7) + J2siii)Mi) + E fmiuimii)) + z^oTt 

V ;=1 m=l / 

where gQ{j) represents the baseline hazard rate in interval gi{i), I = 1, ...,L, are 
time-varying effects of covariates Vii{j), fm{uim{j)), j = ^, are nonlinear effects of 
continuous covariates Ui„,{j) and Z;(;)'7 contains linear, parametric effects. All time- 
dependent quantities are assumed to be piecewise constant on the intervals such that e.g. 
go{t) = goii) for all t e {Kj_i,Kj]. The interval borders k = (0,5, 15,25, ... ,85,90) were 
chosen based on the shape of a nonparametric estimate of the marginal hazard rate. The 
likelihood for the PEM is equivalent to that of a Poisson model with (1) one observation 
for each interval for each subject, yielding 2826 pseudo-observations in total, (2) offsets 
Ojj = max(0,min(Ky — Kj_i,ti — ?cy_i)), where tj is the observed time under risk for subject 
i and (3) responses i/y equal to the event indicators Sjj, with 3ij = if subject / survived 
interval j and Sjj = 1 if not. 

Our aim is twofold: We want to (1) estimate a model that allows assessment of the 
influence of each available covariate on the prognosis of patients, accounting for possibly 
time-varying and /or nonlinear effects and (2) use this setting to evaluate the stability 
of the selection and estimation of increasingly complex models on real data. Important 
covariates included in the analysis are for example the age of the patient, the haemoglobin 
concentration, the presence of a fungal infection, different types of operations and the 
Apache II score (a measure for the severity of disease), see Hofner et al. ( 2011| > for a 
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Figure 5: Posterior means of effects with (pointwise) 80% credible intervals. 
Top: Baseline hazard rate and baseline hazard rate plus the time-varying 
and time-constant effects for direct postoperative admission, presence of a 
fungal infection, palliative operation and beginning of treatment after 2002. 
Bottom: smooth effect of haemoglobin concentration and linear effects of 
age (10 year increase) and Apache n score, a measure for disease severity 
(increase of score by 1 standard deviation). 



complete description. 



Full Data Results We perform term selection for a maximal model which includes the 
(linear and non-linear) effects of all 20 covariates as well as their time-varying effects, i.e. 
48 potential model terms with 262 coefficients in total. Hyperparameters were set to the 
default values determined in the simulation studies, i.e. fl-r = 5, &t = 25, Vq = 0.00025 
and a^p = bw = 1- Estimates are based on 8 parallel chains running for 20000 iterations 
each after a burn-in of 500 iterations, with every W^^ iteration saved. We use a first order 
random walk prior for the log-baseline and the time-varying effects to regularize their 
roughness, i.e., we use an intrinsic GMRF prior (on the line) for the piecewise constant 
time-varying quantities (denoted as MRF (Interval) in the following). 

The estimated marginal inclusion probabilities indicate a fairly sparse model, with pos- 
terior marginal inclusion probabilities greater than 0.10 for only 10 terms, as shown in 
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Term 


P(7 = l) 


MRF (Interval) 


1.00 


palliative operation 


0.19 


treatment period 


0.71 


Age, linear 


0.99 


Apache II score, linear 


1.00 


Haemoglobin concentration, smooth 


0.38 


MRF(Interval):direct postoperative admission 


0.28 


MRF(Interval):fungal infection 


1.00 


MRF(Interval) :palliative operation 


0.38 


MRF (Interval) :treatment period 


0.13 



Table 2: Posterior means of marginal inclusion probabilities P(7 = 1) (only 
given for terms with P(7 = 1) > .1). 

Sepsis Survival Data: 
K-M Estimator & Survival Curves from Posterior Predictive 
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Figure 6: Kaplan-Meier estimate of the survival curve for observed data in 
black. Grey overlays are survival curves for 100 replicates of survival time 
vectors generated from the posterior predictive distribution. 

Table |2] The estimated effects for this subset of terms are visualized in Figure |5] To verify 
the suitability of the model, we perform a posterior predictive check and generate 100 
replicates of survival times from the posterior predictive distribution. Figure |6] indicates 
that the fit is satisfactory, although there seems to be a tendency to overestimate survival 
rates until about day 70. 
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Predictive Performance Comparison We subsample the data 20 times to construct inde- 
pendent training data sets with 415 patients each and test data sets with the remaining 
47 patients to evaluate the precision of the resulting predictions and compare predictive 
performance to that of equivalent component-wise boosting models fitted with m boost. 
Results for our approach are based on 8 parallel chains, each running for 5000 iterations 
after 500 iterations of burn-in, with every fifth iteration saved. Component-wise boost- 
ing results are based on a stopping parameter determined by a 25-fold bootstrap of the 
training data, with a maximal iteration number of 1500. 

The previous analysis by Hofner et al. ( 2011| has used expert knowledge to define a set 
of six covariates forced into the model (indicators for presence of malignant primary dis- 
ease, palliative operation and begirming of treatment after 2002, as well as sex, age and 
Apache II score). We compare results for four model specifications of increasing com- 
plexity that suggest themselves: a model with only the main effects of the pre-selected 
covariate set, a model with main effects and time-varying effects for the pre-selected 
covariate set, a model with main effects for all 20 covariates and the model with main 
effects and time-varying effects for all 20 covariates which was applied to the whole data 
set (see above). As in the previous section, main effects for numerical covariates such 
as age were split into linear and non-linear parts. Figure |7] shows the predictive de- 
viances achieved by the different model specifications. Predictive deviance is defined as 
—2 Y^f' ^!7(log(^7) + '^ij) ~ Oij^i exp(;/,y), where i = 1, . . . ,Nt indicates the subjects in 
the test set and j = 1, . . . , indicates the intervals in which individual i was under risk, 
Xj and f]ij are the respective posterior predictive means. For this data set, models with 
higher maximal complexity seem to offer no relevant improvement in terms of predic- 
tion accuracy compared to the simplest model based only on the pre-selected covariate 
set without time-varying effects. Most of the models yield essentially equivalent predic- 
tions. However, it is reassuring to see that the predictive performance of our approach 
is not degraded at all by the specification of vastly over-complex models in a setting 
where the underlying structure seems to be fairly simple. In contrast, prediction accuracy 
for component-wise boosting decreases markedly for the models including time-varying 
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Sepsis Survival: Deviance on test sets 
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Figure 7: Predictive deviances for 20 subsampling test sets for the sepsis 
survival data (lower is better). Grey lines connect results from identical 
folds. 



effects in this setting. 

The stability of the marginal term inclusion probabilities across subsamples is fairly 
good, indicating that the term selection is robust to small changes in the data. All model 
specifications identified essentially the same subset of important effects from the set of 
pre-selected covariates (i.e., indicators for palliative operation and beginning of treatment 
after 2002 and linear effects of age and Apache II score), and also the same time-varying 
effects (i.e., time varying effects for palliative operation and beginning of treatment after 
2002). Figure [s] shows the posterior means of inclusion probabilities P(7 = 1) across 20 
subsampled training data sets for each of the 4 model specifications. 

Additional case studies for geoadditive regression of net rent levels in Munich and an 
additive mixed model for binary responses from a large study on hymenoptera venom 
allergy can be found in Section |D] of the appendix. 
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Figure 8: Posterior means of inclusion probabilities P(7 — 1) across 20 sub- 
sampled training data sets for the 4 model specifications. 
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4. CONCLUSIONS 



In this paper, we have proposed a general Bayesian framework for conducting function se- 
lection in exponential family structured additive regression models. Inspired by stochas- 
tic search variable selection approaches and the general idea of spike-and-slab priors, we 
introduced a non-identifiable multiplicative parameter expansion where selection or des- 
election of coefficient batches (such as parameters representing a spline basis or random 
intercepts) is associated with a scalar scaling factor only. This reparameterization allevi- 
ates the notorious mixing problems that would appear in a naive implementation of our 
prior structure. 

The main advantages of the proposed peNMIG prior structure are (1) its general ap- 
plicability for various types of responses (in particular non-Gaussian responses), (2) the 
availability of a supplementing R package that makes all methods immediately accessi- 
ble and reproducible, and (3) the good performance demonstrated in simulations and 
applications with fairly low sensitivity with respect to hyperparameter settings and sub- 
stantiated in theoretical investigations of the shrinkage properties of peNMIG. The class 
of models that can be fitted in this framework can be extended fairly easily by considering 
other latent Gaussian or latent exponential family models that can be implemented via 
data augmentation. 

A different perspective on stochastic search variable selection approaches is to consider 
them as a possibility for implementing Bayesian model averaging. Since so far most 
applications and implementations of Bayesian model averaging are restricted to linear or 
generalized linear models, our approach offers a necessary extension of Bayesian model 



averaging implementations to a much broader model class. As shown in Section 3.1 it 
offers improved prediction accuracy and allows for a principled inclusion of uncertainty 
about term selection and model structure into inferential statements. 
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COMPUTATIONAL DETAILS 

The approach described and evaluated in this paper is implemented in the R-package 
spikeSlabGAM (|Scheipl|[20TTb|. 
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A. MCMC algorithm 



B. Problems of the Conventional NMIG Prior when Selecting 
Coefficient Blocks 

Previous approaches for Bayesian variable selection have primarily concentrated on selec- 
tion of single coefficients (George and McCuUoch 1993 Ishwaran and Rao 2005[ l or used 



very low dimensional bases for the representation of smooth effects. E.g. Cottet et al. 
( |2008| l use a pseudo-spline representation of their cubic smoothing spline bases with only 
3 to 4 basis functions. In the following, we argue that conventional blockwise Gibbs sam- 
pling is ill suited for updating the state of the Markov chain when sampling from the 
posterior of an NMIG model even for moderately large coefficient blocks. We show that 
mixing for 7y will be very slow for blocks of coefficients )8y with dj ^ 1. We suppress the 
index j in the following. 

The following analysis shows that, even if the blockwise sampler is initially in an ideal 
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Algorithm 1 MCMC sampler for peNMIG 



1 Initialize r^^°\ y(°'> , a-^^^\ zv^°^ and jS^^^ (/S'"' via IWLS for non-Gaussian response) 

2 Compute a(0),^(0),xf 

3 for iterations i = 1, . . . , T do 

4 for blocks b = 1, ... ,ba do 

5 update a['^ from its fed (Gaussian case, see (|5))/ via P-IWLS 

6 set = X diag(blockdiag (1^1,. . .,l<ip 

7 update m^*^ from their fed: P{mf^ = 1|-) = — I = \,...,q 



l+exp(-2?; 



8 for blocks b = 1, . . . ,b^ do 

9 update from its fed (Gaussian case, see (|5|)/ via P-IWLS 

10 for model terms ; = 1, . . . , p do 

11 rescale and a-*' 

12 set Xi') = Xblockdiag(^f',. . .,4')) 

2(0 1. „,r-i 



2^ 



13 update Ti2(0, Tp^W from their fed: tT^^ |- ~ f-^ i^Ur + 1/2, br 

p( W— 1 1 "I / 2(f) \ 

14 update 7p('' from their fed: ^^^^^J^ = z;J/'exp (^^^J 

15 update w'*^ from its fed: 

w(*) I ■ ~ Beta (fl,, + ^1(7)'^ + ^.0(7)'^)) 

16 if y is Gaussian then 

17 update (72^^^ from its fed: a^^*^ | ■ ~ f-^ f fl^2 + n/2, + ^' 



state for switching between the spike and the slab parts of the prior, i.e. a parameter 
constellation so that the full conditional probability P(7 = 1|-) = .5, such a switch is 
very unlikely in subsequent iterations for coefficient vectors with more than a few entries 
given the NMIG prior hierarchy. 

Assume that the sampler starts out in iteration (0) with a parameter configuration of 
(it,bt,vo,w,T?-Q. and /S(o) so that P(7(o) = = -5. We set w = .5. The parameters for 



^(0) 

which P(7 = 1|-) = .5 satisfy the following relations: 



P(7 = z;o|-)" ° 2vo t2 1 " ^' 
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Figure 9: ^(7) as a function of the relative change in ^ for varying d, 7(0)^ 
Inclusion probability in iteration (1) as a function of the ratio between the 
sum of squared coefficients in iteration (1) and (0). Lines in each panel 
correspond to t^^^ equal to the median of its full conditional and the .1- and 
.9-quantiles. Upper row is for 7(oj = 1, lower row for 7(-o) = vq. Columns 
correspond to rf = 1, 5, 20. Fat gray grid lines denote inclusion probability 
= .5 and ratio of coefficient sum of squares = 1 



SO that P(7 = 1|-) > .5 if 



> 



1-Vo 



log(t7o). 



Assuming a given value r'^^y set 



E/^{o) = -T^log(t;o)T2). 



Now 7(0) takes on both values Vq and 1 with equal probability, conditional on all other 
parameters. 



In the following iteration, t^-^^ is drawn from its full conditional T (at + d/2,b, 



d 13,2 

(01 

27(0) 



). Figure 



shows P(7(i) = l\Tl^ylf j6ji)) as a function of Yf ^\i)flf ^]q) for vari- 
ous values of d. The 3 lines in each panel correspond to P(7(i) = 1 |tj^^)/ /^j^) ) for values 
of T^^-^j equal to the median of its full conditional as well as the .1- and .9-quantiles. The 
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lower row in the Figure plots the function for 7(0) = 1, the upper row for 7(0) = ^o- 

So, if we start in this "equilibrium state" we begin iteration (0) with vo,w, r'^^y and ^^qj 
so that -P(7(o) = = 0.5. We then determine ^"(7(1) 7^ 7(o) /^(i)) a function of 

• various values of dim(j8y) = d, 

• 7(0) = 1 and 7(0) = vq, 

• T^^^^ at the .1, .5, .9-quantiles of its conditional distribution given jS^gy 7(0). 

The leftmost column in Figure |9] shows that moving between 7 = 1 and 7 = Cq is 
easy for d = 1: For a large range of realistic values for Yf /^(i)/ I^'^ f^^(o)' moving back to 
7(1) = ^0 from 7(0) = 1 (lower panel) has reasonably large probability, just as moving from 
7(0) = ^0 to 7(1) = 1 (lower panel) is fairly likely for realistic values of E'^/^(o)- 
For d = 5, however, P(7(i) = already resembles a step function. For d = 20, if 
/^fi) / J^(o) smaller than 0.48, the probability of moving from 7(0) = 1 to 

7(1) = ^0 (lower panel) is practically zero for 90% of the values drawn from p(T(^^-)|-)- 
However, draws of j6 that reduce Yf f^^ by more than a factor of 0.48 while 7 = 1 are 
unlikely to occur in real data. It is also extremely unlikely to move back to 7(1) = 1 when 
7(0) = ^0/ unless f^fi) I j6(o) is larger than 2.9. Since the full conditional for /6 is 
very concentrated if 7 = Vq, such moves are highly improbable and correspondingly the 
sampler is unlikely to move away from 7 = cq- Numerical values for the graphs in Figure 
|9] were computed for At = 5, &t = 50, Vq = 0.00025 but similar problems arise for all 
suitable hyperparameter configurations. 

In summary, mixing of the indicator variables 7 will be very slow for long subvectors. 
In experiments, we observed posterior means of P (7 = 1) to be either or 1 across 
a wide variety of settings, even for very long chains, largely depending on the starting 
values of the chains. A multiplicative parameter expansion offers a possible remedy, 
with the added benefit of inducing very desirable shrinkage properties for the resulting 
estimates as shown in the article. 
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C. Simulation Results 



In the following Sections C.l and C.2 we compare the performance of peNMIG in (gen- 



eralized) additive models (GAMs) as implemented in package spikeSlabGAM ( [Scheipl 



2011b| l to that of component-wise boosting ( [Hothorn et al. [2010| in terms of predictive 



MSE and complexity recovery. As a reference, we also fit a conventional GAM (as im- 
plemented in mgcv (Wood 2008[ >) based on the "true" formula (i.e. a model without any 



of the "noise" terms), which we subsequently call the "oracle"-model. For Gaussian re- 
sponses only, we also compare our results to those from ACOSSO ( [Storlie et al. 2010| |. 



ACOSSO is not able to fit non-Gaussian responses. Section C.3 investigates the effects 
of concurvity on effect estimates and term selection for our method and those of some 
recently proposed competitors. 

We supply separate base learners for the linear and smooth parts of covariate influ- 
ence for the component-wise boosting in order to compare complexity recovery between 
boosting and our approach. We use 10-fold cross validation on the training data to deter- 
mine the optimal stopping iteration for m boost and count a baselearner as included in the 
model if it is selected in at least half of the cross-validation runs up to the stopping iter- 
ation. BIC is used to determine the tuning parameter for ACOSSO. We did not compare 
our approach to [Reich et^L[ ([2009), which is implemented for Gaussian responses, since 
the available R implementation is impractically slow - computation times are usually 15 
- 30 times those of our peNMIG implementation. 



For both Gaussian responses (Section C.l| and Poisson responses (Section C.2| , the data 



generating process has the following structure: 
• We define 4 functions 

- /i(^) = 

- /3(x) = —X -\- 7rsin(7rx), 

- fi{x) = 0.5x + 15^(2(x — .2)) — (p{x + 0.4), where (p{) is the standard normal 
density function. 
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which enter into the linear predictor. Note that all of them have (at least) a linear 
component. 

• We define 2 scenarios: 

- a "low sparsity" scenario: Generate 16 covariates, 12 of which have non-zero 
influence. The true linear predictor is — +/2(x2) + fsixs) +/4(x4) + 

1.5(/i(X5) +f2{X6) +f3{X7) +/4(X8)) +2(/i(x9) + flixw) + fsixu) + Mxn))- 

- a "high sparsity" scenario: Generate 20 covariates, only 4 of which have non- 
zero influence and rj = fi{xi) + f2(,X2) + fsixs) + /4(^4)- 

• The covariates are either 

- '-h^- U[-2,2] or 

- from an AR(1) process with correlation p = 0.7. 

• We simulate 50 replications for each combination of the various settings. 
We compare 9 different prior specifications arising from the combination of 

• {ar,br) = (5,25), (10,30), (5,50) 

• 170 = 0.00025,0.005,0.01 

Predictive MSE is evaluated on test data sets with 5000 observations. Complexity recov- 
ery, i.e. how well the different approaches select covariates with true influence on the 
response and remove covariates without true influence on the response is measured in 
terms of accuracy, defined as the number of correctly classified model terms (true posi- 
tives and true negatives) divided by the total number of terms in the model. For example, 
the full model in the "low sparsity" scenario has 32 potential terms under selection (linear 
terms and basis expansions/ smooth terms for each of the 16 covariates), only 21 of which 
are truly non-zero (the linear terms for the first 12 covariates plus the 9 basis expansions 
of the covariates not associated with the linear function /i()). Accuracy in this scenario 
would then be determined as the simi of the correctly included model terms plus the 
correctly excluded model terms, divided by 32. 
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Figure 10: Prediction MSE divided by oracle MSE for Gaussian response. 
Boxplots show results for the different prior settings, horizontal ribbons 
show results for m boost (solid) and ACOSSO (dashed), respectively: Shaded 
region gives IQR, line represents median. Dark grey lines connect results for 
the same replication. Columns from left to right: 200 obs. with SNR=5, 20; 
1000 obs. with SNR=5, 20. Rows from top to bottom: uncorrelated obs. with 
sparse and unsparse predictor, correlated obs. with sparse and unsparse 
predictor. Vertical axis is on binary log scale. 



C.l. Gaussian response 

In addition to the basic structure of the data generating process described at the beginning 
of this section, the data generating process for the Gaussian responses has the foUow^ing 
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Figure 11: Complexity recovery for Gaussiari response: proportiori of cor- 
rectly included and excluded model terms. Boxplots show results for the 
different prior settings, horizontal ribbons show results for m boost (solid) 
and ACOSSO (dashed), respectively: Shaded region gives IQR, line rep- 
resents median. Dark grey lines connect results for the same replication. 
Columns from left to right: 200 obs. with SNR=5, 20; 1000 obs. with SNR=5, 
20. Rows from top to bottom: uncorrelated obs. with sparse and unsparse 
predictor, correlated obs. with sparse and unsparse predictor. 



properties: 

• signal-to-noise-ratio SNR = 5, 20 

• number of observations: n = 200, 1000 
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Figure 10 shows the mean squared prediction error divided by the one achieved by 
the "oracle"-model, a conventional GAM without any of the noise variables. Predictive 
performance is very robust against the different prior settings especially for the settings 
with low sparsity. Different prior settings also behave similarly within replications, as 
shown by the mostly parallel grey lines. Predictions are more precise than those of both 
boosting and ACOSSO, and this improvement in performance relative to the "true" model 
is especially marked for n = 1000 (two rightmost columns). With the exception of the first 
scenario, the median relative prediction MSB is < 2 everywhere, while both boosting and 
ACOSSO have a median relative prediction MSB above 4 in most scenarios that goes up to 
above 32 and 64 for ACOSSO and boosting, respectively, in the "large sample, correlated 
covariates" cases. In the "large sample, low sparsity" scenarios (two leftmost columns in 
rows two and four), the performance of our approach comes very close that of the oracle 
model - the relative prediction MSBs are close to one. 



Figure 11 shows the proportion of correctly included and excluded terms (linear terms 
and basis expansions) in the estimated model. Bxcept for vq = 0.00025, accuracy is con- 
sistently lower than for ACOSSO. However, a direct comparison with ACOSSO is not 
entirely appropriate because ACOSSO does not differentiate between smooth and linear 
terms, while m boost and our approach do. Therefore ACOSSO solves a less difficult prob- 
lem. Bstimated inclusion probabilities are very sensitive to vq and comparatively robust 
against {a-^br). Across all settings, vq = 0.00025 delivers the most precise complexity 
recovery, with sensitivities consistently above 0.7. The accuracy of peNMIG is better than 
m boost for the sparse settings (first and third rows) because the specificity of our approach 
is > .97 across settings, regardless of the prior (!), while mboost mostly achieves only very 
low specificity, but fairly high sensitivity. 



C.2. Poisson response 

In addition to the basic structure of the data generating process described at the beginning 
of this section, the data generating process for the Poisson responses has the following 
properties: 
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Figure 12: Prediction MSE divided by oracle MSE (on the scale of the linear 
predictor). Boxplots show results for the different prior settings. Horizontal 
ribbons show results for m boost: shaded region gives IQR, line represents 
median. Dark grey lines connect resiilts for the same replication. Columns 
from left to right: 500 obs., 2000 obs. Rows from top to bottom: imcorrelated 
obs. with sparse and unsparse predictor, correlated obs. with sparse and 
imsparse predictor. Vertical axis is on binary log scale. 



• niunber of observations: n = 500,2000 



• responses are generated w^ith overdispersion: 
yi ~ Po (siexp(?/i)) ; Sj ~ Lr[0.66, 1.5] 
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Figure 13: Complexity recovery for poisson response: proportion of correctly 
included and excluded model terms. Boxplots show results for the different 
prior settings. Horizontal ribbons show results for m boost: shaded region 
gives IQR, line represents median. Dark grey lines connect results for the 
same replication. Columns from left to right: 500 obs., 2000 obs. Rows 
from top to bottom: uncorrelated obs. with sparse and unsparse predictor, 
correlated obs. with sparse and unsparse predictor. 



We did not use vq = 0.01 for this experimer\t because of its inferior performance in terms 
of complexity recovery in the Gaussian case. 

Figure [12] shows the mean squared prediction error (on the scale of the linear predic- 
tor) divided by the one achieved by the "oracle"-GAM that includes only the relevant 
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covariates and no noise terms. Predictive performance is very robust against the different 
prior settings. Different prior settings also behave similarly within replications, as shown 
by the mostly parallel grey lines. Predictions are more precise than those of m boost, es- 
pecially for smaller data sets (left column) and correlated responses (two bottom rows). 
For the "low sparsity correlated covariates" setting (bottom row), the performance of 
our approach comes fairly close to that of the "oracle"-GAM, with relative prediction er- 
rors mostly between 1 and 1.5, and occasionally even improving on the oracle model for 
n = 500. 



Figure 13 shows the proportion of correctly included and excluded terms (linear terms 
and basis expansions) in the estimated models. Estimated inclusion probabilities are 
sensitive to vq and comparatively robust against (at/^t)- The smaller value for vq tends 
to perform better in the unsparse settings (second and fourth rows) since it forces more 
terms into the model (resulting in higher sensitivity and lower specificity) and vice versa 
for the sparse setting and the larger vq. Complexity recovery is (much) better across the 
different settings and priors for our approach than for boosting. The constant accuracy 
for m boost in the low sparsity scenario with uncorrelated responses (second row) is due 
to its very low specificity: It includes practically all model terms all the time. 

C.3. Gaussian GAM with concurvity 

We use a similar data generating process as the one used in the previous Subsections: 

• We define functions /i(^) to /4(x) as in the data generating process for the previous 
Subsections. 

• We use 10 covariates: The first 4 are associated with functions /i to f^, respectively, 
while X5 to x-[Q are "noise" variables without contribution to the linear predictor. 

• covariates x are ~ LJ[— 2,2] 

• we distinguish 3 scenarios of concurvity: 

- in scenario 1, x^ = c ■ g{x3) + (1 — c) ■ u, i.e., two covariates with real influence 
on the predictor are functionally related. 
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- in scenario 2, X5 = c • g{x4) + (1 — c) ■ w, i.e., a "noise" variable is a noisy version 
of a function of a covariate with direct influence. 

- in scenario 3, X4 = c ■ g{x5) + (1 — c) • u, i.e., a covariate with direct influence is 
a noisy version of a "noise" variable. 

where g{x) = 2^{x,}l = -\,a^ = 0.16) +20(x,^ = i^a^ = 0.09) - ^{x) - 2 with 
cr^) defined as the cdf of the respective Gaussian distribution, i. i. d. standard 
normal variates u, and the parameter c controlling the amount of concurvity: c = 1 
for perfectly deterministic relationship, and c = for independence. In our simula- 
tion, c = 0, .2, .4, .6, .8,1. 



we use signal-to-noise ratio SNR= 1, 5. 



• we simulate 50 replications for each combination of the various settings. 

Predictive MSB is evaluated on test data sets with 5000 observations. We use the default 
prior fl-r = 5, = 25, Vq = 0.00025 for our approach and present results for 12 chains run 
in parallel for 2600 iterations each, discarding the first 100 as burn-in and keeping every 
fifth iteration. We compare predictive MSB and the sensitivity of the selection to various 
degrees of concurvity to results from 

• m boost, as above. 



BART: Bayesian additive regression trees (Chipman et al. 2010| l as implemented in 
R-package BayesTree ( [Chipman and McCulloch |2010[ l, as an example of a non- 
parametric method that does not yield interpretable models. 



hgam: the high-dimensional additive model approach of Meier et al. ( 2009 1, as im- 
plemented in R-package hgam 



spam: the approach for sparse additive models by Ravikumar et al. ( [2009 >, with a 
covariate assumed to be selected as a linear influence if its estimated associated 
degrees of freedom were between 0.1 and 1, and assumed to be selected as a non- 
linear influence if its estimated associated degrees of freedom were > 1. 
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Figure 14: Prediction MSE. Boxplots show results for the different algo- 
rithms. Columns for the three scenarios, top row for signal to noise ratio 
1, bottom row for signal to noise ratio 5. Vertical axis is on binary log scale. 



mgcv-DS: the double shrinkage approach for GAM estimation and term selection 
described in Marra and Wood ( |2011 1, as implemented in R-package mgcv, with a 



covariate assumed to be selected as a linear influence if its estimated associated 
degrees of freedom were between 0.1 and 1, and assumed to be selected as a non- 
linear influence if its estimated associated degrees of freedom were > 1. 

Figure [T4| compares prediction MSEs for the various methods across scenarios and signal- 
to-noise ratios for varying degrees of concurvity. It is clear to see that our approach 
dominates in terms of prediction accuracy in these difficult settings, with BART and 



mboost as fairly close competitors. Figure 15 shows the proportion of correctly selected 



or removed covariates (i.e., "true positives" and "true negatives"). As for prediction ac- 



curacy, our approach outperforms the rest, with the double shrinkage approach of Marra 
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Proportion of true negatives and true positives 
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Figure 15: Selection accuracy as proportion of correctly selected or removed 
predictors. Boxplots show results for the different algorithms. Columns for 
the three scenarios, top row for signal to noise ratio 1, bottom row for signal 
to noise ratio 5. 



and Wood ( |2011[ | a close second for selection accuracy for the noisy setting (but much 



worse in terms of prediction). Figure 15 does not include BART since its implementa- 
tion in BayesTree does not offer clear inclusion or exclusion indicators. The only variable 
importance measure (i.e., how often a given variable was used in nodes across the en- 
semble of trees) returned by BART had roughly the same mean for influential and noise 
variables in all our simulations and would not have yielded a useable picture of the true 



predictor structure. Finally, figure 16 shows that estimated inclusion probabilities are 
fairly unreliable for intermediate to strong concurvity in noisy settings (top row) if both 
variables that are involved have separate effects (left column) - in this scenario, estimated 
inclusion probabilities of the covariate X3, which is a (noisy) function of another covariate 
X4, decrease dramatically for intermediate concurvity and recover somewhat for perfect 
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Figure 16: Posterior inclusion probabilities for X3, X4,X5. Shown are the re- 
spective maxima of the posterior inclusion probabilities for the linear and 
smooth effects of each variable in each replication. Columns for the three 
scenarios, top row for SNR 1, bottom row for SNR 5. Fat grey horizontal line 
denotes P(7 = 1) = 0.5 



curvilinearity, w^here inclusion probabilities for X3 are somewhat reduced. Note hov^ever, 
that if a model including interactions of X3 and X4 is specified in this scenario, our ap- 
proach will often select those. This behavior makes sense if the effects of X3 and X4 cannot 
be clearly separated, as in this case. 

If a variable X5 without true effect is a (noisy) function of another covariate x^ with true ef- 
fect (second column), inclusion probabilities are fairly stable and yield correct inferences 
about the model structure unless the curvilinear relationship is perfect (concurvity= 1), 
at which point it again becomes impossible to disentangle the effect of X4 and X5. Even 
so, using a threshold for inclusion of P(7 = > 0.5 the correct model structure would 
have been identified in 44 out of 50 replicates for low SNR and 17 out of 50 for high SNR. 
For noisy data (top row), the scenario in which a noisy version X4 of a spurious covariate 
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X5 has true effects (third column) is problematic, with inclusion probabilities for ^4 de- 
creasing strongly under intermediate and strong concurvity. Note, however, that even for 
strong concurvity > 0.6, the true model structure was identified at least 24 times out of 
50 replicates for low SNR and at least 41 times out of 50 replicates for high SNR. 



C.4. Summary 

The simulations for generalized additive models show that the proposed peNMIG-Model 
is very competitive in terms of estimation accuracy and confirms that estimation results 
are robust against different hyperparameter configurations even in fairly complex models, 
and under strong concurvity. Model selection is more sensitive towards hyperparameter 
configurations, especially vq. For smaller vq, spikeSlabGAM seems to be able to distinguish 
between important and irrelevant terms fairly reliably. 

The performance of peNMIG as implemented in spikeSlabGAM seems to be very com- 
petitive to that of component- wise boosting as implemented in m boost and clearly domi- 
nates other function selection approaches in our concurvity simulation study. Simulation 
results for an earlier, more rudimentary implementation of the peNMIG model on identi- 
cal data generating processes and for many other settings are published in Scheipl||2010|. 



D. Additional Case Studies 

According to German law, increases in rents for flats can be justified based on "average 
rents" paid for flats that are comparable in size, equipment, quality and location. As a 
consequence, most larger cities publish rental guides that provide such average rents, ob- 
tained from regression models with net rents or net rents per square meter as dependent 
variables. 

Data: We analyze data on approximately 3,000 flats in Munich collected by Infratest 
Sozialforschung for the 2007 rental guide. The original data contain approximately 270 
covariates describing characteristics of the flats as diverse as the quality of bathroom 
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equipment, whether the flat is rented for the first time, the presence and size of a garden 
or a balcony, etc. While most of the covariates are categorical, there are some continu- 
ous covariates of designated interest that are suspected to have nonlinear impact on the 
net rent per square meter. Moreover, a spatial variable is available that represents the 
subdistrict of Munich where the flat is located. 

Model: We will model net rents per square meter with a high-dimensional geoadditive 
regression with Gaussian errors and linear predictor 

267 

Vi = N+fspASi) + J2f{Xij), 

;=i 

where fspat{si) is the spatial effect of subdistrict s, modeled with a Gaussian Markov 
random field (GMRF). The linear predictor additionally contains potential smooth effects 
of the year of construction, floorspace and begin of tenancy as well as 265 other potentially 
influential and mostly categorical covariates Xj. In total, this model has 594 coefficients in 
269 model terms. Term selection is particularly challenging in this scenario because the 
available covariates include a number of redundant and highly collinear covariates such 
as an indicator variable for the presence of at least one balcony, a numeric variable giving 
the size of the flat's balconies in square meters and a count variable giving the number of 
balconies. 

Hyperparameters were set to the default values determined in the simulation studies, 
i.e. At = 5, = 25,1^0 = 0.00025 and aj„ = bu, = 1. Estimates are based on 20 parallel 
chains running for 2500 iterations each after a burn-in of 500 iterations, with every fifth 
iteration saved. 



Results: Table|3]lists the terms with posterior inclusion probability greater than 10%. The 
additive predictor is dominated by the contributions of terms for the presence of balcony, 
the date of beginning of the tenancy, the quality of the residential area the property is 



located in and presence of an attic and presence of a playground. Figures 17 and 18 show 
the estimated effect of subdistrict on net rent per square meter and the continuous predic- 
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Term 


P(7 = l) 


MRF(subdistrict) 


0.86 


Floorspace, linear 


0.95 


Floorspace, smooth 


0.97 


Year, smooth 


0.10 


Begin of Tenancy, linear 


1.00 


Begin of Tenancy, smooth 


0.97 


Quality of Residential Area, cat. 


0.58 


Company Housing, cat. 


0.82 


Has Planted Area, cat. 


0.26 


Refrigerator, cat. 


0.31 


Kitchen Type, cat. 


0.31 


Intercom, cat. 


0.80 


Flooring, cat. 


0.51 


Has Balcony, cat. 


0.48 


Number of Balconies, cat. 


0.21 


Has Terrace, cat. 


0.34 


Number of Terraces, cat. 


0.15 


Has Roof Terrace, cat. 


0.19 


Number of Roof Terraces, cat. 


0.12 


Area of 2nd Balcony, linear 


0.12 


Has Clothes Drying Area, cat. 


0.15 


Has Playground, cat. 


0.62 


Has Attic, cat. 


0.60 


Garden-Use Permission, cat. 


0.14 


Apartment Type, cat. 


0.15 



Table 3: Terms with inclusion probability P{j = 1) > 0.1. 



tor effects, respectively, both combined with associated credible intervals. It general, the 



estimated effects resemble those found in previously published analyses (cf. Kneib et al. 



2011 1 of this data, with lower rents in predominantly working-class outskirts and higher 



rents in fashionable districts in and around the city centre. Including the beginning of 
tenancy (which has not been included in the analyses by Kneib et al. 2011| > seems to some- 
what attenuate the effect of the year of construction of the flat, and the total floorspace 
has a much larger effect on net rent per square meter. 



Cross Validation Performance: We perform a 10-fold crossvalidation to gain some insight 
into the stability of the term selection and to compare the predictive performance of our 
approach with previous modeling efforts described in |Kneib et al. ( |2011 1. While ridge and 
lasso priors to select single dummy variables of specific factor levels have been used in 
their model, our results are based on blockwise selection including or excluding all levels 
of a categorical covariate simultaneously. Models in Kneib et al. ( |2011| > were estimated 
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Effect of Subquarter 



80% credible intervals 




Figure 17: Map of Munich's subdistricts with estimated effect on net 
rent/m^. Left panel shows posterior mean of effects, right panel shows the 
sign of 80% posterior credible intervals: regions with lower net rent in white, 
higher net rent in black, regions with credible intervals overlapping zero in 
gray. Subdistricts without any observations are filled with diagonal lines. 
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Figure 18: Smooth terms with posterior inclusion probability greater than 
10% for Munich rental guide data. Grey ribbons show 80% pointwise credi- 
ble intervals. 



with the software package BayesX ( [Brezger et al. 2005[ |. We also considered "expert" 
models in analogy to Kneib et al. ( [2011 1 including only a strongly reduced number of 
covariates as candidates, estimated both with BayesX and our function selection approach. 
For the latter, we also estimated an expanded "expert" model including all potential two- 
way interactions (except those with the spatial GMRF term). The resulting model has 1051 
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coefficients in 190 model terms including varying coefficient terms and smooth interaction 
surfaces. 

Figure [19] displays the cross validation error on the ten folds for the different model 
specifications. We see that prediction accuracy is not diminished by putting a model 

Munich Rental Guide: Crossvalidation Error 
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Figure 19: Prediction error for 10 cross validation folds for the Munich rental 
guide data. Grey lines connect results from identical folds. 



with only relevant terms under selection, as shown by the equivalent performances of the 
peNMIG "expert"-model and that of the "expert"-model fit with BayesX (first and third 
boxes from the left). This indicates that the relevant effects are estimated without bias 
despite the variable selection prior associated with them, a consequence of the adaptive 
shrinkage properties of the peNMIG prior. As for the sepsis survival data analyzed in the 
subsequent section, there is no noticeable change in prediction performance in most folds 
if a large number of interaction terms are added — compare the similar performances 
of the expert model and of the expert model with additional two-way interactions (first 
and second boxes from the left). The underlying reason is that only a single interaction 
effect, the interaction between the level of kitchen furnishing and level of pollution has 
P(7 = 1) > 0.1 in the expanded "expert" model, but its effect is still very small. The 
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precision of the prediction achieved by our peNMIG approach on the full data set with 
269 potential model terms (fourth box from the left) is very similar to that of Bayesian 
lasso, Bayesian ridge and the conventional NMIG approach (as implemented in BayesX) 
for most folds, even though estimates of the dominating (nonlinear) effects for floorspace, 
year of construction and begin of tenancy and the effect of subquarter were associated 
with a variable selection prior in our model while they where associated with conven- 
tional smoothing priors in BayesX. This reinforces our conclusion that important effects 
are estimated without a selection-induced attenuating bias in our approach. 

Variable selection is very stable across the folds, with the same terms included in all 
ten folds for the "expert" model and the "expert" model with interactions. Only a single 
one of the interactions (between kitchen furnishings and level of pollution), has marginal 
posterior inclusion probabilities above 0.1 in six of the ten folds, the rest is excluded un- 
equivocally in all folds. The model with interactions is more conservative and includes 
less of the main effect terms than the smaller model, because the large number of irrele- 
vant terms moves most of the posterior mass for w, the overall prior inclusion probability, 
towards very low values: the posterior means of w in the smaller model are between 0.71 
and 0.85, while the posterior means for w in the model with interactions are between 0.07 
and 0.09. In the model with all possible covariates, a core set of 23 covariates is identified 
in at least nine out of the ten folds, while nine other covariates have marginal posterior 
inclusion probabilities above 0.1 in at least one fold. Of those nine, two do so in eight of 
the folds. 



D.l. Case Study: Hymenoptera Venom Allergy 

Data We reanalyze data on bee and wasp venom allergy from a large observational 
multicenter study previously analyzed in Rueff et al. ( 2009| |. The data consists of 962 



patients from 14 European study centers with established bee or wasp venom allergy who 
suffered an allergic reaction after being stung. The binary outcome of interest is whether 
patients suffered a severe, life-threatening reaction, defined as anaphylactic shock, loss 
of consciousness, or cardiopulmonary arrest. A severe reaction was observed for 206 of 
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the 962 patients (21.4%). Data were collected on the concentration of tryptase, a potential 
biomarker, patients' sex and age, whether the culprit insect was a bee or wasp, on the 
intake of three types of cardiovascular medication (/5-blockers, ACE inhibitors and anti- 
hypertensive drugs), whether the patient had had at least one minor systemic reaction 
to a sting prior to the index sting and the CAP-class (a measure of antibody load) of the 
patient with regard to the venom of the culprit insect, with levels 0, 1,2,4,5+. 

Models An analysis of this data has to take into account possible study center effects, 
possible non-linear effects of both age and the (logarithm of) blood serum tryptase con- 
centrations and the possibility of differing effect structures for bee and wasp stings. Our 
aim is twofold again: We want to (1) estimate a model that allows assessment of the influ- 
ence of each covariate on the susceptibility for a severe reaction, accounting for possibly 
nonlinear effects and interaction effects and (2) use this setting to evaluate the stability 
of the selection and estimation of increasingly complex models on real data as well as 
investigate the consequences of less-than-optimal sampler convergence we observed. 

Full Data Analysis We fit a peNMIG model with all main effects and all second order in- 
teractions except those with study centre, with smooth functions for both age and tryptase 
and a random intercept for the study center. In total, this model has 267 coefficients in 
66 model terms: 13 main effects including the global intercept, separate linear and non- 
linear terms for age and tryptase and a random intercept for study centre, 21 interactions 
between the seven factor variables, 28 terms for the linear and smooth interactions for age 
and tryptase with each of the seven factors, and four terms for the interaction effect of 
age and tryptase (one linear-linear interaction, two varying coefficient terms, one smooth 
interaction surface) Results are based on samples from 20 chains with 40000 iterations 
each after 1000 burn-in, with every 20*^ saved. Running a single chain of this length on a 
modern desktop computer (i.e., Intel Q9550 2.83GHz) takes about 45 minutes, so that the 
entire fit takes about 4 hours on a quad-core CPU. 

Figure [20| shows the estimated effects of the terms with P(7 = 1) > .1 that are listed in 
Table |4] Since the inclusion probabilities indicate interlocking interactions of cap, tryptase 
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cap*insect*logtr yp stings , sex, ag e 




bee 



Figure 20: Posterior means of effects with (pointwise) 80% credible inter- 
vals. Only effects for terms with marginal inclusion probability > .1 are 
shown. Since there is some evidence for interlocking interactions of cap 
class, tryptase and culprit insect (c.f. Table |4|, the left graph shows the joint 
effect of these 3 variables. The graph on the right shows the relative effects 
of previous stings (compared to none before the index sting), female gender 
(compared to male) and an increase in the patient's age by 10 years. 

and culprit insect, the panels in the left graph in the figure show the joint effects of these 
3 variables. Each panel shows the effect estimate of tryptase plasma concentration for 
bee patients (dark grey) and wasp patients (light grey) for the given CAP class. The rug 
plot at the bottom indicates the locations of the data. The large uncertainty precludes 
a detailed interpretation of this 3-way interaction, but in general, the risk is higher for 
wasp patients: the main effect of culprit insect yields an odds ratio of 1.16 (80%CI: 1- 
2.43) and the increase in risk in wasp patients seems to be smaller for lower and larger 
for higher tryptase concentrations. The graph on the right shows the relative effects of 
previous stings (compared to none before the index sting), female gender (compared to 
male) and an increase in the patient's age by 10 years. Estimated random effects for the 
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Term 


P(7 = l) 


culprit insect 


0.16 


stings 


1.00 


sex 


0.70 


age, linear 


1.00 


tryptase, log-linear 


1.00 


study centre 


0.71 


insect:tryptase, smooth 


0.46 


cap:tryptase, log-linear 


0.14 



Table 4: Posterior means of marginal inclusion probabilities P(7 = 1) (only 
given for terms with P{'y = 1) > .1). 

study centres are not shown, their associated posterior mean odds ratios range between 
0.44 and 2.13. 



Lack of Convergence for 7 For this fairly complicated model, we experience some difficul- 
ties with the convergence of the MCMC sampler: We observe poor mixing for some of the 
entries in 7, with chains getting stuck in basins of attraction around posterior modes for 
long periods of time. This leads to posterior inclusion probabilities for single chains often 
ending up either close to zero or close to one for some of the terms. Running a large num- 
ber of parallel chains from random starting configurations seems to remedy this problem. 
To investigate this issue, we perform a large MCMC experiment with 800 chains, each 



with 10000 iterations after 100 burnin, for the model described above. Figure 21 shows 
the average inclusion probabilites for the 16 terms with the highest between-chain vari- 
ability of P(7 = 1) for 20 fits with 40 chains each. Grey lines connect posterior means 
based on an increasing number of chains for each fit. The black horizontal line shows the 
mean over all 800 chains, which we presume to be a good estimate of the "true" marginal 
posterior inclusion probability. Convergence of the posterior means is slow for these 
terms, but discrimination between important, intermediate and negligible effects seems 
to be reliable based on as few as 10 to 20 chains. While we would not be comfortable in 
claiming that 10 or 20 parallel chains are enough to completely explore this very high- 
dimensional model space and yield a reliable estimate of posterior model probabilities, 
i.e., the joint distribution of 7, the marginal inclusion probabilities P(7y = 1), / = 1, . . . , p 
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Figure 21: Average inclusion probabilites for those terms with convergence 
issues for 20 fits with 40 chains each. Grey lines connect posterior means 
over an increasing number of chains for each fit. Black horizontal line shows 
the mean over all 800 chains. 



of the various terms seem to be estimated v^ell enough to distinguish between important, 
intermediate and negligible effects, w^hich is usually all that is required in practice. This 
conclusion is also borne out by the UCI benchmark study in the main article. 



Predictive Performance Comparison We subsample the data 20 times to construct inde- 
pendent training data sets with 866 subjects each and test data sets with the remaining 
96 patients to evaluate the precision of the resulting predictions and compare predictive 
performance to that of equivalent component- wise boosting models fitted with m boost 
and an unregularized GAMM-fit with all main effects estimated with gamm4. Results for 
our approach are based on 8 parallel chains each running for 10000 iterations after 500 
iterations of burn-in, with every 10^^ iteration saved. Component-wise boosting results 
are based on a stopping parameter determined by a 25-fold bootstrap of the training data, 
with a maximal iteration nimiber of 500. We compare three model specification of increas- 
ing complexity: a simple model with main effects only, a model with main effects and 
all interactions between culprit insect and the other covariates, and the complex model 
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Hymenoptera Venom Allergy: AUG on test sets 
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Figure 22: Area under the ROC curve for 20 test sets from the hymenoptera 
venom allergy data set, higher is better. Grey lines connect results from 
identical folds. 



with all main effects and second order interactions presented before. We were unable to 
fit the latter model with m boost, and the model including the insect interactions could 
not be fitted by m boost for 4 of the training data sets. We report results for the 16 sets 



remaining. Figure 22 shows the area under the ROC cuve (AUC) achieved by the different 



model specifications. For this data set, the models with higher maximal complexity show 
slight decreases in predictive accuracy, but still perform better than an unregularized 
generalized additive mixed model (GAMM) on the far right. 

Despite the fairly low number of parallel chains and comparatively short chain lengths, 
the stability of the marginal term inclusion probabilities across subsamples is fairly good, 
indicating that the term selection is robust to small changes in the data and that even as 
few as 8 chains may be enough to reach fairly reliable rough estimates of term importance 
in this difficult setting. All model specifications identified the same subset of important 
main effects (i.e., number of previous stings before the index sting, sex, linear effects of 



age and the log of tryptase and the random effect for study centre). Figure 23 shows the 
posterior means of inclusion probabilities P(7 = 1) across 16 subsampled training data 
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Figure 23: Posterior means of inclusion probabilities P(7 = 1) across 16 
subsampled training data sets for the 3 model specifications. 
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sets for each of the 3 model specifications. 
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